About Data Analysis Report

This RMarkdown file contains the report of the data analysis done for the project on forecasting daily bike rental demand using time series models in R. It contains analysis such as data exploration, summary statistics and building the time series models. The final report was completed on Wed Sep 27 00:45:06 2023.

Data Description:

This dataset contains the daily count of rental bike transactions between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information.

Bike-sharing rental process is highly correlated to the environmental and seasonal settings. For instance, weather conditions, precipitation, day of week, season, hour of the day, etc. can affect the rental behaviors.

Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.

More information on the dataset can be gotten from this:

Data Source: https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset

Relevant Paper:

Fanaee-T, Hadi, and Gama, Joao. Event labeling combining ensemble detectors and background knowledge, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg

Loading data and installing packages

## Import required packages
install.packages('stats')
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
## Warning: package 'stats' is a base package, and should not be updated
install.packages('forecast')
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
install.packages('highcharter')
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
library(stats)
library(forecast)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(highcharter)
## Loading the datasets
data <- read.csv("hour.csv")
print(head(data))
##   instant     dteday season yr mnth hr holiday weekday workingday weathersit
## 1       1 2011-01-01      1  0    1  0       0       6          0          1
## 2       2 2011-01-01      1  0    1  1       0       6          0          1
## 3       3 2011-01-01      1  0    1  2       0       6          0          1
## 4       4 2011-01-01      1  0    1  3       0       6          0          1
## 5       5 2011-01-01      1  0    1  4       0       6          0          1
## 6       6 2011-01-01      1  0    1  5       0       6          0          2
##   temp  atemp  hum windspeed casual registered cnt
## 1 0.24 0.2879 0.81    0.0000      3         13  16
## 2 0.22 0.2727 0.80    0.0000      8         32  40
## 3 0.22 0.2727 0.80    0.0000      5         27  32
## 4 0.24 0.2879 0.75    0.0000      3         10  13
## 5 0.24 0.2879 0.75    0.0000      0          1   1
## 6 0.24 0.2576 0.75    0.0896      0          1   1

Data cleaning

## checking quality of data before analysis

# Checking if there are any missing values in the data frame (hour_df and day_df)
cat("Are there any missing data in the dataframe?: ", anyNA(data), "\n")
## Are there any missing data in the dataframe?:  FALSE

Describing and exploring the data

print(summary(data))
##     instant         dteday              season            yr        
##  Min.   :    1   Length:17379       Min.   :1.000   Min.   :0.0000  
##  1st Qu.: 4346   Class :character   1st Qu.:2.000   1st Qu.:0.0000  
##  Median : 8690   Mode  :character   Median :3.000   Median :1.0000  
##  Mean   : 8690                      Mean   :2.502   Mean   :0.5026  
##  3rd Qu.:13034                      3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :17379                      Max.   :4.000   Max.   :1.0000  
##       mnth              hr           holiday           weekday     
##  Min.   : 1.000   Min.   : 0.00   Min.   :0.00000   Min.   :0.000  
##  1st Qu.: 4.000   1st Qu.: 6.00   1st Qu.:0.00000   1st Qu.:1.000  
##  Median : 7.000   Median :12.00   Median :0.00000   Median :3.000  
##  Mean   : 6.538   Mean   :11.55   Mean   :0.02877   Mean   :3.004  
##  3rd Qu.:10.000   3rd Qu.:18.00   3rd Qu.:0.00000   3rd Qu.:5.000  
##  Max.   :12.000   Max.   :23.00   Max.   :1.00000   Max.   :6.000  
##    workingday       weathersit         temp           atemp       
##  Min.   :0.0000   Min.   :1.000   Min.   :0.020   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:1.000   1st Qu.:0.340   1st Qu.:0.3333  
##  Median :1.0000   Median :1.000   Median :0.500   Median :0.4848  
##  Mean   :0.6827   Mean   :1.425   Mean   :0.497   Mean   :0.4758  
##  3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:0.660   3rd Qu.:0.6212  
##  Max.   :1.0000   Max.   :4.000   Max.   :1.000   Max.   :1.0000  
##       hum           windspeed          casual         registered   
##  Min.   :0.0000   Min.   :0.0000   Min.   :  0.00   Min.   :  0.0  
##  1st Qu.:0.4800   1st Qu.:0.1045   1st Qu.:  4.00   1st Qu.: 34.0  
##  Median :0.6300   Median :0.1940   Median : 17.00   Median :115.0  
##  Mean   :0.6272   Mean   :0.1901   Mean   : 35.68   Mean   :153.8  
##  3rd Qu.:0.7800   3rd Qu.:0.2537   3rd Qu.: 48.00   3rd Qu.:220.0  
##  Max.   :1.0000   Max.   :0.8507   Max.   :367.00   Max.   :886.0  
##       cnt       
##  Min.   :  1.0  
##  1st Qu.: 40.0  
##  Median :142.0  
##  Mean   :189.5  
##  3rd Qu.:281.0  
##  Max.   :977.0

This Analysis is centered around the variable ‘cnt’ which tells us the total count of registered users per hour(hour_df) or per day(day_df).

The summary statistic for the ‘cnt’ variable shows that;

  • The minimum registered users on a particular day is 1

  • Has an average number of daily registered users of 190.

Exploring seasonal effects on the Number of Registered Users

  • How does Holiday affects the number of Registered users?
install.packages("dplyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
holiday_day <- data %>% group_by(holiday) %>% summarize(registered_users = sum(cnt))
holiday_day <- holiday_day %>% mutate(holiday = recode(holiday, `0` = "No", `1` = "Yes"))


print(holiday_day)
## # A tibble: 2 × 2
##   holiday registered_users
##   <chr>              <int>
## 1 No               3214244
## 2 Yes                78435

Lets put this in a visualization to better present it

barplot(holiday_day$registered_users, names.arg = holiday_day$holiday, col = "blue", main = "Effects of holiday on number of registered users", xlab = "Holiday", ylab = "Number of registered Users")

The seasonality (Holidays) really affects the number of users that registers. The difference between the Number of Registered Users when there are no holidays and Number of Registered users when there are holidays is very large.

The seasonality of holidays, i.e there are holidays, contributes 2.4% to the number of users registered in that program.

This explains that, during holidays, most percentage of people don’t go out for businesses or work.

  • How does Seasons(spring, summer, fall, winter) affects the number of Registered users?

Now, lets go over the relationship between the seasons of a year and the number of registered users. Does the season have any effect on the number of users registering? well, let’s quench the curiosity by actually looking at how each of them contributes to the number of registered users.

Since the seasons variable contains discrete values, we would be using the barplot to show the relationship.

## grouping the data 
seasons <- data %>% group_by(season) %>% summarize(registered_users = sum(cnt))
seasons <- seasons %>% mutate(season = recode(season, `1` = "spring", `2` = "summer", `3` = "fall", `4` = "winter"))

max_value = max(seasons$registered_users)

barplot(seasons$registered_users, names.arg = seasons$season, col = "purple", main = "Effects of Seasons on number of  registered users", ylim = c(0, max_value+200000), xlab = "seasons", ylab = "Number of registered Users")

During the spring seasons, there are low registrations, and during the fall season, the number of registration is at its peak than any other seasons.

Research tells us that spring season is the period after winter, i.e, it is the transmission period from cold to warm, whereby, fall seasons(also known as autumn) is the transmission period from warm to cold.

This supports our analysis of number of registered users by seasons. At the spring season, the temperature is changing from cold to warm, this indicates that a whole population of people don’t ride bikes on a sunny or hot period, because of health challenges and some other factors. While, looking at the fall season(autumn), it has the highest number of registered users, this is because there is a change in the temperature from warm(hot) to cold.

To put it all together: - Users registers more during a cold season than hot(warm) seasons. - And, seasons affect the number of registered users.

  • Relationship between temperature and the number of registered users?
plot(x = data$cnt, y = data$temp, main='Relationship Between Temp and Users registration', xlab='No of registered users', ylab='temperature in celscius')

The temperature does not neccesarily affect the registration rate of users. The visualization shows that there is no correlation between the number of registered users and temperature of that day.

Creating interactive time series plots

# installing packages
install.packages('timetk')
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
install.packages('plotly')
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
install.packages('lubridate')
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
#loading packages
library(plotly)
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(timetk)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
# getting the time series data
data_time = data[c('dteday', 'cnt')]

# Converting the date variable to a date format
data_time$dteday <- as.Date(data_time$dteday)

# Converting the Date column to a time series object using timetk
data_time <- tk_tbl(data_time, date_var = "dteday")
## Warning in tk_tbl.data.frame(data_time, date_var = "dteday"): Warning: No index
## to preserve. Object otherwise converted to tibble successfully.
# Create an interactive time series plot using plotly
plot_ly(data_time, x = ~dteday, y = ~cnt, type = "scatter", mode = "lines") %>%
  layout(title = "Interactive Time Series Plot", xaxis = list(title = "date"), yaxis = list(title = "number of registered users"))

Smoothing time series data

Smoothing time series data involves reducing noise or fluctuations in the data to reveal underlying trends and patterns more clearly.

There are several methods to smooth time series data. Here we would be performing the Holt-Winters exponential smoothing.

install.packages("forecast")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
library(forecast)

# Perform Holt-Winters smoothing
smoothed_data <- ets(data_time$cnt, model = 'ZZZ')

data_time_smooth <- data_time

data_time_smooth$cnt <- fitted(smoothed_data)

plot(data_time$dteday, data_time$cnt, type = 'l', col = 'red', xlab = 'Date', ylab = 'number of registered users')
lines(data_time$dteday, fitted(smoothed_data), type = 'l', col = 'blue')
legend('topleft', legend = c('Original', 'Smoothed'), col = c('red', 'blue'), lty = 1)

Fitting and forecasting time series data using Autoregressive Integrated Moving Average (ARIMA) models

install.packages("tseries")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
library(tseries)

# Checking for stationarity using the Augmented Dickey-Fuller test
adf_test <- adf.test(data_time_smooth$cnt)
## Warning in adf.test(data_time_smooth$cnt): p-value smaller than printed p-value
print(adf_test)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  data_time_smooth$cnt
## Dickey-Fuller = -9.5484, Lag order = 25, p-value = 0.01
## alternative hypothesis: stationary

We see that the p-value is lower than the significance level of 0.05, thereby rejecting the null hypothesis(not stationary) and accepting the alternative hypothesis(stationary).

Therefore, we conclude that the data is stationary.

# fitting an arima model
arima_model <- auto.arima(data_time_smooth$cnt)

print(arima_model)
## Series: data_time_smooth$cnt 
## ARIMA(2,1,1) 
## 
## Coefficients:
##          ar1      ar2      ma1
##       0.4504  -0.3268  -0.0647
## s.e.  0.0235   0.0095   0.0249
## 
## sigma^2 = 8458:  log likelihood = -103230.3
## AIC=206468.5   AICc=206468.5   BIC=206499.6

The auto.arima() function automatically selects the best ARIMA model order based on AIC (Akaike Information Criterion).

The ARIMA model it selected, ARIMA(2,1,1) indicates:

AR(2): Two autoregressive terms. I(1): The time series has been differenced one times. MA(1): One moving average terms.

#forecasting with the arima model
forecast_values <- forecast(arima_model, h = 31)  # Forecasting 1 month ahead

print(forecast_values, "\n")
##       Point Forecast     Lo 80    Hi 80      Lo 95     Hi 95
## 17380       50.11389  -67.7458 167.9736  -130.1369  230.3647
## 17381       54.68647 -146.7102 256.0831  -253.3231  362.6960
## 17382       60.30438 -188.0116 308.6203  -319.4621  440.0708
## 17383       61.34007 -215.4496 338.1297  -361.9732  484.6534
## 17384       59.97051 -240.8542 360.7952  -400.1012  520.0422
## 17385       59.01525 -266.1098 384.1403  -438.2206  556.2511
## 17386       59.03263 -290.1136 408.1788  -474.9404  593.0057
## 17387       59.35264 -312.2558 430.9610  -508.9734  627.6787
## 17388       59.49108 -332.8785 451.8606  -540.5864  659.5685
## 17389       59.44884 -352.4617 471.3593  -570.5139  689.4116
## 17390       59.38458 -371.2290 489.9982  -599.1821  717.9512
## 17391       59.36944 -389.2415 507.9804  -626.7219  745.4607
## 17392       59.38363 -406.5474 525.3147  -653.1964  771.9637
## 17393       59.39496 -423.2202 542.0101  -678.7012  797.4911
## 17394       59.39543 -439.3342 558.1251  -703.3458  822.1366
## 17395       59.39194 -454.9470 573.7309  -727.2217  846.0055
## 17396       59.39021 -470.1015 588.8819  -750.3974  869.1779
## 17397       59.39058 -484.8337 603.6149  -772.9287  891.7098
## 17398       59.39130 -499.1769 617.9595  -794.8650  913.6476
## 17399       59.39151 -513.1606 631.9437  -816.2514  935.0345
## 17400       59.39137 -526.8111 645.5938  -837.1279  955.9106
## 17401       59.39124 -540.1509 658.9334  -857.5294  976.3118
## 17402       59.39122 -553.2003 671.9828  -877.4867  996.2691
## 17403       59.39126 -565.9775 684.7600  -897.0277 1015.8102
## 17404       59.39128 -578.4987 697.2813  -916.1773 1034.9598
## 17405       59.39128 -590.7789 709.5614  -934.9581 1053.7407
## 17406       59.39127 -602.8313 721.6139  -953.3908 1072.1734
## 17407       59.39127 -614.6684 733.4509  -971.4940 1090.2765
## 17408       59.39127 -626.3011 745.0836  -989.2846 1108.0672
## 17409       59.39127 -637.7397 756.5222 -1006.7785 1125.5610
## 17410       59.39127 -648.9936 767.7761 -1023.9899 1142.7724
cat("Forcasted number of users that will register in the next 1 month: ", sum(forecast_values$mean))
## Forcasted number of users that will register in the next 1 month:  1829.943
plot(forecast_values, main = "Next 1 Month Forecast", ylim = c(0, 2000), xlim = c(16000, dim(data_time_smooth)[1] + 110), xlab = "Day", ylab = "No of registered users")

The forecasts are shown as a blue line, with the 80% prediction intervals as a dark ash shaded area, and the 95% prediction intervals as an ash shaded area.

Findings and Conclusions

  • Seasonality affects the rental demand of bikes

  • The Impacts of each seasons of the year on the rental demand of bikes are as follows:

    • Fall(Autumn) seasons contributes 32.2% to the rental demand of bikes.
    • Summer seasons contributes 27.9% to the rental demand of bikes.
    • Winter seasons contributes 25.6% to the rental demand of bikes.
    • Spring Season contributes the lowest, 14.3%, to the rental demand of bikes
  • Holidays has a negative effect on the rental demand of bikes, there is a decrease of demand during holidays.

  • The Fall(Autumn) season contributes the highest to the rental demands of bikes.

  • There is no effect of temperature to the demand of bikes.

  • The time series model forecasts that there would be a 0.06% increase for the rental demand of bikes within the next 1 month.